8 research outputs found

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Towards the detection of cross-language source code reuse

    Full text link
    Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.This work has been developed with the support of the project TEXT-ENTERPRISE 2.0: Text comprehension techniques applied to the needs of the Enterprise 2.0 (MICINN, Spain TIN2009-13391-C04-03 (PlanI+D+i)).Flores Sáez, E.; Barrón Cedeño, LA.; Rosso, P.; Moreno Boronat, LA. (2011). Towards the detection of cross-language source code reuse. En Natural Language Processing and Information Systems. Springer Verlag (Germany). 6716:250-253. https://doi.org/10.1007/978-3-642-22327-3_31S2502536716Arwin, C., Tahaghoghi, S.M.M.: Plagiarism Detection across Programming Languages. In: Proceedings of the 29th Australasian Computer Science Conference, vol. 48, pp. 277–286 (2006)Faidhi, J., Robinson, S.: An empirical approach for detecting program similarity and plagiarism within a university programming environment. Comput. Educ. 11, 11–19 (1987)Jankowitz, H.T.: Detecting plagiarism in student pascal programs. The Computer Journal 31(1) (1988)Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., Rosso, P.: A statistical approach to crosslingual natural language tasks. Journal of Algorithms 64(1), 51–60 (2009)Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-Language Plagiarism Detection. Languages Resources and Evaluation. Special Issue on Plagiarism and Authorship Analysis 45(1) (2011)Rosales, F., García, A., Rodríguez, S., Pedraza, J.L., Méndez, R., Nieto, M.M.: Detection of plagiarism in programming assignments. IEEE Transactions on Education 51(2), 174–183 (2008)Stamatatos, E.: Intrinsic Plagiarism Detection Using Character n-gram Profiles. In: Proc. SEPLN 2009, Donostia, Spain, pp. 38–46 (2009

    Overview of the 3rd International Competition on Plagiarism Detection

    Full text link
    [EN] This paper overviews eleven plagiarism detectors that have been developed and evaluated within PAN’11. We survey the detection approaches developed for the two sub-tasks “external plagiarism detection” and “intrinsic plagiarism detection,” and we report on their detailed evaluation based on the third revised edition of the PAN plagiarism corpus PAN-PC-11.This work was partly funded by the European Commission as part of the WIQEI IRSES project (grant no. 269180) within the FP7 Marie Curie People Framework, by MICINN as part of the TextEnterprise 2.0 project (TIN2009-13391-C04-03) within the Plan I+D+i, and as part of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Potthast, M.; Eiselt, A.; Barrón Cedeño, LA.; Stein, B.; Rosso, P. (2011). Overview of the 3rd International Competition on Plagiarism Detection. CEUR Workshop Proceedings. 1177. http://hdl.handle.net/10251/46639S117

    Cross-language source code re-use detection using latent semantic analysis

    Full text link
    [EN] Nowadays, Internet is the main source to get information from blogs, encyclopedias, discussion forums, source code repositories, and more resources which are available just one click away. The temptation to re-use these materials is very high. Even source codes are easily available through a simple search on the Web. There is a need of detecting potential instances of source code re-use. Source code re-use detection has usually been approached comparing source codes in their compiled version. When dealing with cross-language source code re-use, traditional pproaches can deal only with the programming languages supported by the compiler. We assume that a source code is a piece of text ,with its syntax and structure, so we aim at applying models for free text re-use detection to source code. In this paper we compare a Latent Semantic Analysis (LSA) approach with previously used text re-use detection models for measuring cross-language similarity in source code. The LSA-based approach shows slightly better results than the other models, being able to distinguish between re-used and related source codes with a high performance.This work was partially supported by Universitat Polit`ecnica de Val`encia, WIQ-EI (IRSES grant n. 269180), and DIANA-APPLICATIONS (TIN2012- 38603-C02- 01) project. The work of the fourth author is also supported by VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Flores Sáez, E.; Barrón-Cedeño, LA.; Moreno Boronat, LA.; Rosso, P. (2015). Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science. 21(13):1708-1725. https://doi.org/10.3217/jucs-021-13-1708S17081725211

    Extracting Parallel Corpora from Wikipedia on the basis of Phrase Level Bilingual Alignment

    Full text link
    [EN] This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used word-level alignment models from IBM in order to obtain phrase-level bilingual alignments between documents pairs. We have manually annotated a set of test English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging.[ES] Este art'¿culo presenta una nueva t'ecnica de extracci'on de corpus paralelos de la Wikipedia mediante la aplicaci'on de t'ecnicas de traducci'on autom'atica estad'¿stica. En concreto, se han utilizado los modelos de alineamiento basados en palabras de IBM para obtener alineamientos biling¿ues a nivel de frase entre pares de documentos. Para su evaluaci'on se ha generado manualmente un conjunto de test formado por pares de documentos ingl'es-espa¿nol, obteni'endose resultados prometedores.Este trabajo se ha llevado a cabo en el marco del VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems, financiado parcialmente por parte de la EC (FEDER/FSE; WIQEI IRSES no. 269180 / FP 7 Marie Curie People), por el MICINN como parte del proyecto Text-Enterprise 2.0 (TIN2009-13391-C04-03) en el Plan I+D+i, y por la beca 192021 del CONACyT. Tambi´en ha recibido apoyo por parte del EC (FEDER/FSE) y del MEC/MICINN bajo el programa MIPRCV “Consolider Ingenio 2010” (CSD2007-00018) y el proyecto iTrans2 (TIN2009-14511), por el MITyC en el marco del proyecto erudito.com (TSI-020110-2009-439), por la Generalitat Valenciana con las ayudas Prometeo/2009/014 y GV/2010/067, y por el “Vicerrectorado de Investigaci´on de la UPV” con la ayuda 20091027.Silvestre Cerdà, JA.; Garcia Martinez, MM.; Barrón Cedeño, LA.; Civera Saiz, J.; Rosso ., P. (2011). Extracción de Corpus Paralelos de la Wikipedia basada en la Obtención de Alineamientos Bilingües a Nivel de Frase. CEUR Workshop Proceedings. 824:14-21. http://hdl.handle.net/10251/27930S142182

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    Detección automática de plagio en texto

    Full text link
    El plagio de texto significa incluir en un documento texto escrito por otra persona sin darle crédito. Hemos probado algunos mètodos existentes y desarrollado dos nuevos para la detección de plagio: uno para la reducción del espacio de búsqueda y otro para detectar plagios de un idioma a otro. Estas subtareas prácticamente no se han abordado antes.Barrón Cedeño, LA. (2008). Detección automática de plagio en texto. http://hdl.handle.net/10251/12186Archivo delegad

    Overview of the 4th International Competition on Plagiarism Detection

    Full text link
    [EN] This paper overviews 15 plagiarism detectors that have been evaluated within the fourth international competition on plagiarism detection at PAN 12. We report on their performances for two sub-tasks of external plagiarism detection: candidate document retrieval and detailed document comparison. Furthermore, we introduce the PAN plagiarism corpus 2012, the TIRA experimentation platform, and the ChatNoir search engine for the ClueWeb. They add scale and realism to the evaluation as well as new means of measuring performance.This work was partly funded by the EC WIQ-EI project (project no. 269180) within the FP7 People Program, by the MICINN Text-Enterprise (TIN2009-13391-C04-03) research project, and by the ERCIM "Alain Bensoussan" Fellowship Programme (funded from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement number 246016).Potthast, M.; Gollub, T.; Hagen, M.; Graßegger, J.; Kiesel, J.; Michel, ML.; Oberländer, A.... (2012). Overview of the 4th International Competition on Plagiarism Detection. CLEF 2012 Evaluation Labs and Workshop – Working Notes Papers, 17-20 September. 101-128. http://hdl.handle.net/10251/55282S10112